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Abstract 

Here we propose an algorithm, named generalized orthogonal components regres- 
sion (GOCRE), to explore the relationship between a categorical outcome and a set 
of massive variables. A set of orthogonal components are sequentially constructed to 
account for the variation of the categorical outcome, and together build up a gener- 
alized linear model (GLM). This algorithm can be considered as an extension of the 
partial least squares (PLS) for GLMs, but overcomes several issues of existing exten- 
sions based on iteratively reweighted least squares (IRLS). First, existing extensions 
construct a different set of components at each iteration and thus cannot provide a 
convergent set of components. Second, existing extensions are computationally inten- 
sive because of repetitively constructing a full set of components. Third, although they 
pursue the convergence of regression coefficients, the resultant regression coefficients 
may still diverge especially when building logistic regression models. GOCRE instead 
sequentially builds up each orthogonal component upon convergent construction, and 
simultaneously regresses against these orthogonal components to fit the GLM. The 
performance of the new method is demonstrated by both simulation studies and a real 
data example. 

Key Words: categorical data; classification; collinear; dimension reduction; multi- 
collinear 



1 INTRODUCTION 

Available high-tlirougliput biotechnologies have made it possible to genotype thousands of 
genetic markers, meanwhile, they bring challenges to statistical analyses of these data. Such 
data are characterized by a large number of variables (p) observed from a relatively small 
number of subjects (n), and create the well-known large p small n problems. To deal with 
this issue, an important strategy is to reduce the high dimensionality of the predictors before 
fitting models. As a supervised dimension-reduction method, partial least squares (PLS) by 
Wold (1975) has drawn considerable attentions, see Vinzi et al. (2010). PLS constructs or- 
thogonal components such that these components capture information of original predictors 
predicting response variables, and linear models are built on the base of these components 
instead of the original predictors. It is computationally fast and able to take collinear or 
multicollinear predictors. 

Success of PLS in fitting linear models motivates extensions to generalized linear models 
(GLMs). With the iteratively reweighted least squares (IRLS) algorithm commonly used 
for building regular GLMs (Green 1984), Marx (1996) proposed an extension, i.e., the iter- 
atively reweighted partial least squares (IRPLS) algorithm, which replaces the least squares 
estimates with PLS estimates at each iteration. It is a natural extension of PLS, however, 
a different set of orthogonal components are constructed at each iteration and thus the con- 
vergence of original regression coefficients is pursued. As a result, the loadings of orthogonal 
components never converge, and even regression coefficients especially for logistic regressions 
rarely converge. A full set of distinct components at each iteration not only make it difficult 
to interpret, but also demand intensive computation. 

Much effort has been devoted to solving the non-convergence issue of IRPLS. Ding and 
Gentleman (2004) applied the bias reduction procedure proposed by Firth (1993) to IRPLS, 
specifically for the classification problems. Firth (1993) modified the score function to remove 
the first order term of the asymptotic bias of maximum likelihood estimators for GLMs. 
Heinze and Schemper (2002) showed that this bias reduction procedure may also avoid the 
common infinite estimate problem of logistic regressions. However, the non-convergence issue 
still exists in IRPLS by Ding and Gentleman (2004), possibly due to varying components 



at each iteration. Alternatively, Fort and Lambert-Lacroix (2005) proposed to build up 
continuous pseudo-responses via ridge regression and then apply PLS to regress these pseudo- 
responses against the predictors; Nguyen and Rocke (2002) instead proposed to first apply 
PLS by treating the responses as continuous, and then fit a regular GLM using the resultant 
orthogonal components instead of the original predictors. 

Here we propose a different strategy, namely, the generalized orthogonal components 
regression (GOCRE), to extend the supervised dimension reduction idea in PLS and fit high 
dimensional GLMs. While IRPLS repetitively constructs a different set of components at 
each iteration and targets a convergent set of regression coefficients, GOCRE sequentially 
constructs orthogonal components which maximally account for the remaining variation in 
the categorical outcome. The bias correction procedure by Firth (1993) is also applied. The 
proposed method enjoys computational privilege over IRPLS since IRPLS needs to rebuild 
all orthogonal components at each iteration. The construction of orthogonal components is 
also different from the methods by Fort and Lambert-Lacroix (2005) and Nguyen and Rocke 
(2002), both directly maximizing correlation between categorical responses and components. 

This paper is organized as follows. The next section introduces our proposed method 
in details. Simulation studies are shown in Section 3, and an application of the proposed 
method to a real dataset is presented in Section 4. We close the paper with a brief discussion. 

2 THE METHOD 

2.1 High Dimensional Generalized Linear Model 

Suppose the distribution of response F is a member of the exponential family distribution, 

/(|/|^) = exp|^^-^ + c(i/,0)|, (1) 

where 9 is the canonical parameter, and is the known dispersion parameter. A link function 
g{-) further relates the mean of response Y to the p predictors in X, i.e., 

^(E[r|X]) = /i + X/3, (2) 



where /x is the intercept, and /3 is a p-dimensional column vector containing all regression 
coefficients of the predictors. The inverse function of g{-) is denoted as g^^{-). 

With a size n sample {(yj,Xj),z = 1,2,- ■■ ,n}, a common issue is how to provide a 
legitimate estimate of /3 in ([2]) when p ^ n. Denote X = (x^, ■ ■ ■ ,x^)*, an n x p matrix 
with rank r^. < min{n,p). The classical maximum likelihood estimators (MLEs) of /3 form a 
space with dimension at least p — r^. Suppose an n x r^; matrix X^ is constructed by a subset 
of columns of X, and further assume that there is a unique maximum likelihood estimator 
of f3s for the following model, 

g{E[Y\Xs])=fi + Xs/3s. (3) 

Correspondingly there exists a unique MLE of (3, namely (3, in model (|2]), satisfying the 
following assumption. 

Assumption 1. P^ip = whenever X-?/) = 0„xi- 

In the case that X is of rank r^, the above assumption equivalently puts p — Tx constraints 
on MLE /3 to make model ([2]) identifiable. This assumption makes practical sense in solving 
the coUinearity or multicoUinearity issue. For example, if the j-th predictor consistently 
doubles the value of the fc-th predictor, we have /3j = 2/3^. Therefore, the scale of the 
predictor, if preserved, may indicate its importance. On the other hand, when the predictors 
are identical, the corresponding regression coefficients will also be identical. 

Due to the aforementioned multicoUinearity issue, we can focus on building model ([2]) 
with /3 satisfying the following assumption, a population version of Assumption 1. 

Assumption 2. /3V = whenever Xip = 0, a.s. 

In the next section, we consider the construction of the GOCRE model for any ran- 
dom pair {Y, X) from the population. GOCRE sequentially builds orthogonal components 
Xwj, j = 1, 2, ■ ■ ■ , to account for the variation of the categorical outcomes. For the same 
reason mentioned above, each vjj satisfies the following assumption, leading to (3 satisfying 
Assumption 2 when a full set of components are used to build model ([2]). 

Assumption 3. w^.ip = whenever Xip = 0, a.s. 



2.2 Generalized Orthogonal Components Regression (GOCRE) 

The orthogonal components will be sequentially constructed with a prespecified weight w for 
each pair of Y and X in the whole population such that E[w\ is finite. We further assume 
that i^fwX] = Op, where Op is a p-dimensional column vector with all components as zero. 
Note that such a weighted centralization of the random vector X plays an important role in 
carrying out GOCRE. Since GOCRE constructs orthogonal components relying on a linear 
regression model whose response value changes at each iteration, the intercept has to be 
updated at each iteration (unlike PLS which removes intercept from the regression model). 
This weighted centralization allows separate calculation of the intercept and orthogonal 
components. For convenience, we denote 'Wg^^{ri) = dg^^{ri)/dri in the following. 
First, let Xi = X and for a specific r], i.e., r] = r]^^\ we calculate 

Zir]) = r]+{Y- g-\r])} /Vg-\r^). (4) 

A component Xia{ri) can be constructed with a = a{ri) maximizing ||i?[Z(r7)u'XiQ;]|p under 
the condition ||q;|| = 1. With a scaler variable Z{ri), we indeed have 



a{r]) = E[XiwZir])]/\\E[XlwZi 

Then regressing Z = Zirf) against Xio; with a = a{ri) leads to an update of r], 

ri{a) = E[wZ]/E[w] + X^a-fi, (5) 

where 71 = E[a'''XlwZ]/E[a'''XlwXia]. Alternatively update a{ri) and ri{a) until a{ri) 
converges to ai, which leads to the construction of the first component Xiai. 

After constructing the j-th component Xjaj, we remove Xjaj from Xj such that Xj^i = 
Xj — Xjaj9j is orthogonal to Xjaj, i.e., 

E[X!j^^wXjaj] = 0^ej = E[a'jX'jWXj]/E[a'jX'jWXjaj]. (6) 



Since 



'i-i 
Xjaj = Xj_i{I - aj^i9j_i)aj = ■ ■ ■ = X <^ ]^(/ - aj„i9j_i) ^ aj, 

.1=1 



we have the following preposition. 



Preposition 1. Each component XjUj can be rewritten as Xwj where 

^i = j n^^ " '^3-1^3-1) \ «i- 

Furthermore, with the inner product defined as (x, y) = E[xwy\, the components Xwi^ 
Xzu2, ■ ■ ■ , are orthogonal. 

Through these first j orthogonal components, we can obtain an estimate of 77, say rij. 
Taking rj = rij, we calculate Z{ri) following (jl]). A component Xj+ia(?7) is then constructed 
with 

aiv) = ^Tg ms.x{\\E[Zi^)wX^^,a]f} = E[X'^^,wZ{v)]/\\E[X'^^,wZm\. (7) 

a:||a||=l 

Regressing Z = Zirf) against Xjj^iairf) as well as the first j components leads to an update 
of r/, 

j 
7]{a) = E[wZ]/E[w] + Y, Xkakik + ^i+i«7, (8) 

fc=i 

where 7 = E[a''Xj^-^wZ]/E[a''Xj^-^^wXjj^.ia], and 7^ = E[alXlwZ]/ E[alXlwXkak] for k = 
1, • • • ,j. Alternatively update 0(77) and ri{a) until a{ri) converges to a^+i, which leads to 
the construction of the (j + l)-st component Xj+iaj^i. 

Such construction stops whenever w^^'^Z^t]) is uncorrelated to w^^'^Xj^i. Upon comple- 
tion of the construction, w^^'^Xzui, w^^'^Xw2, w^^'^Xzu^, ■ ■ ■ , are uncorrelated, which lead to 
the generalized orthogonal-components regression model with orthogonal components Xwi, 
XVU2, Xw-i, ■ ■ ■ . 

Preposition 2. Upon completion of the construction, we can build up the generalized 
orthogonal-components regression model, 

g{E\Y\X\) = i,^Y.^,{Xw,;) , (9) 

3 

where Wj, j = 1,2, ■■■, are as specified in Preposition 1, and ^j, j = 1,2, ■■■, are the 

regression coefficients of the corresponding orthogonal components. Furthermore each Wj 

satisfies Assumption 3, and /3 = ^ • i!^jTUj satisfies Assumption 2. 
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Proof. When Xiip = Xijj = 0, a.s., alip = following ([7]). It leads to X2^ = 0, a.s.. 
(jip = 0, a.s., and a^jip = 0- Hence w^ji 



Iteratively we have Xjip = 0, a.s., and a^.ip = 0- Hence w^,ip = 0, j = 1, 2, ■ • • , which leads 



to /3V = 0. 

2.3 The Algorithm 

With observed data Y = (yi, ■ ■ ■ , ?/„)* and X = (x^, ■ ■ ■ , x^)*, we can follow the above idea 
to sequentially construct orthogonal components accounting for the variation in Y, and also 
provide an estimate of /3 satisfying Assumption 1. The construction proceeds on the basis of 
prespecified weight Wi for the i-th observation. We denote W = diag{wi, ■ ■ ■ , m„}. Without 
loss of generality, we further assume that Xi = X has been column- wisely centralized, i.e., 
X*W^1„ = Op, where 1„ is an n-dimensional column vector with all components as one. 

Suppose that components Xiai,--- ,Xj_iq;j_i have been constructed, rjj^i is output 
from the construction of the {j — l)-st component Xj_iq;j_i, and Xj is also constructed. We 
can therefore proceed to construct the j-th component X^aj, tjj, and X^+i as follows, 

1. Initialize rjj = r]j-i; 

2. Update Z = r/, + H-^{Y - 9-^7]^)}, with H = diag{Vg-\r]ji), " " " , ^g~\Vjn)h 

3. Update /i = l^iyZ/{l^iyi„}; 

4. Update aj = X.'-WZ/\\X.'-WZ\\; 

5. Update 7fc = aiXiWZ/{aiXlWXkak} ioik = !,■■■ , j; 

6. Update rjj = /il„ + Y.k=i ^kOiklk] 

7. Iterate between 2-6 until aj converges; 

8. Calculate Pj = a*X*W^Xj/{a*X*W^Xjaj}, and X^+i = X^- -XjajPj. 

Note that rij = (77^1, ■■ ■ ,rjjnY. In Step 2, we also abuse the notations by defining 



fc-1, 



Remark 1. For each k, X^ = 'Kk-i{Ip - ak-iPk-i) = ■ ■ ■ = X x []/=! {Ip - (^iPi 



therefore X^iyi„ = Op following X*iyi„ = Op. The weighted least squares estimation 
equation l^WZ = l^iy (yul„ + Yli=i Ik^k^k) leads to Step 3. 

Remark 2. Calculation of Pj in Step 8 implies that X*_^^iyXja;j = 0. Iteratively, it 
leads to X*_,_]^H^Xfcafc = for k = j, j — 1, ■ ■ ■ , 1. That is, the components Xiai, 'K.2a2, 
Xatts, ■ ■ ■ , are orthogonal when the inner product is defined as {x, y) = x^Wy. 

Remark 3. Step 5 follows the application of X|.P1/1„ = Op and a^X^VTXza; = 0, I j^ k, 
to the weighted least squares estimation equation a^XllVZ = Q;^X^14^(/il„ + XlLi li^i'^i)- 

Remark 4. From the construction of the last component, say ^m<^m, we have the 
estimate 

j=i U=i J 

which can be sequentially updated and satisfies Assumption 1. The parameter /i in the 
model ([2]) can be estimated by /i upon constructing the last component. 

Compared to the original model ([2]), the GOCRE model iQ not only present the unique 
MLE satisfying Assumption 1, but also calculate the MLE without computing inverse of 
any matrix. While both features are desirable in analyzing p ^ n data, the latter one 
particularly speeds up the calculation. 

2.4 Selection of Weights 

Note that, for a specific r], Z{ri) in (jl]) has the variance 

var{Z{r])) = h"{e)a{c^) I {Vg-\ii)Y . 

Therefore, it is preferred to have a dynamic weight w{ini) oc l/var{Z{ri)). However, such 
a dynamic weight makes it impossible to construct orthogonal components in any specific 
inner product space. 

One strategy is to take dynamic weights when the first component is being iteratively con- 
structed with the identity matrix as the initial value. Once the first component is constructed, 
we have a converged weight matrix, and therefore use this weight matrix for constructing all 
subsequent components. 

9 



Another strategy is to run the aforementioned algorithm twice. The first run of the 
algorithm may take an identity weight matrix or use the above strategy to construct a 
weight matrix. The second run can construct a weight matrix based on the r] value from the 
last step of the previous run. 

Our simulation study demonstrated that the first strategy usually performs well and 
there is negligible gain in taking a second run of the algorithm (results not shown). 

2.5 Convergence Failures and Bias Correction 

For some GLMs, especially the logistic regression model for binary responses, MLE may 
not exist due to complete separation, or quasicomplete separation of different categories 
(Albert and Anderson 1984), and it usually results in non- convergence of the corresponding 
algorithm. Heinze and Schemper (2002) proposed that the penalized likelihood method by 
Firth (1993) can solve the convergence problem due to the aforementioned separation issue. 
Suppose the model in ([2]) has log-likelihood i{fi, (3) and information matrix I{fi,l3). In- 
stead of directly maximizing the log-likelihood function. Firth (1993) proposed to maximize 
the penalized log-likelihood 

r(/x,/3) = £(/x,/3) + ilog{|/(/i,/3)|}, 

where the penalty corresponds to the Jeffreys invariant prior (Jeffreys 1946). Firth (1993) 
initially took this modification to reduce the bias of maximum likelihood estimates, and 
showed that the first order bias can be removed. 

For logistic regression, we will modify our algorithm using the same idea to reduce the 
bias and solve the non- convergence issue. Note that only Step 2 of the algorithm in Section 
2.3 needs to be modified. Define 

A = (40nxn = W-2X{X'WX)+X'W-2, 

C = ((5ii,--- ,6nn)\ 

A = diag{6ii, ■ ■ ■ , 6nn}, 

where (X^WX)"*" is a Moore-Penrose pseudo- inverse. We then replace Step 2 with the fol- 
lowing steps, 
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2a. Calculate H = diag{{l + 5ii)Vg ^(%i), ■ ■ ■ , (1 + Snn)'^g ^{Vjn)}'-, 
2b. Update Z = r/,- + R-^Y + \C - (/„ + K)g~\r],)}. 

Note that, once the weight matrix W is fixed, we then have fixed values of 5kk-, k = 
1, ■ ■ ■ ,n. Therefore, unlike the IRPLS method modified by Ding and Gentleman (2004) 
which requires re-calculation of the weight matrix at each iteration, we do not need this 
re-calculation when constructing all components other than the first component, because of 
the fixed weight matrix W. Since high-dimensional data imply large matrices involved in cal- 
culation of the weight matrix, our algorithm can be more efficient in terms of computational 
cost. 

We use GOCREq to refer to the above implementation for bias correction. As shown in 
the following, the calculation of A can be simplified through a singular value decomposition, 
which will essentially speed up the computation of GOCREq. 

For large n, it is still computationally intensive to calculate and maintain the matrix 
A. Instead, Chung and Keles (2010) approximate each diagonal component of A with 
trace{A}/n. A similar strategy has been utilized in constructing generalized cross-validation 
(Golub et al. 1979). Observing that A is usually of full rank and therefore trace{A}/n = 1 
when p ^ n, Chung and Keles (2010) always take ( = In and A = /„. Indeed, as shown 
below, A = J„ when p > n and X is of full rank. 

Preposition 3. Assume rank(X.) = k and therefore the singular value decomposition W^^/^X = 
UQV^ where U is n x k with U^U = Ik, ^ is a k x k diagonal matrix with positive diag- 
onal elements, and V is p x k with V'^V = Ik- Then A = f/f/*. Furthermore, if A; = n, 
then A = /„; if A; = n - 1 and l^W^X = 0^, then A = /„ - W^/HnliW^/y\\W\\i, where 

As we preprocess X such that l^H^X = 0^, A will usually have the rank of n — 1 instead 
of full rank when p ^ n. The above preposition implies that ( = In — Wln/\\W\\i and 
A = /„ — W^/||W^||i. Hereafter, we will use GOCRE to refer to such an implementation, that 
is, taking 6kk = I - Wk/ Ya=i Wi, k = 1,2,- ■■ ,n. 
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3 SIMULATION STUDIES 

We simulated large p small n data to evaluate the performance of GOCRE and compare 
it with IRPLS implemented by Marx (1996) and Ding and Gentleman (2004), which are 
hereafter denoted by IRPLS-M and IRPLS-DG respectively. The underlying models take 
the logit link function in ([2]) with fi = and p = 1000. The predictors were divided into 
ten blocks, where each block was simulated from an AR{1) process with the correlation p 
prespecified at p = 0, 0.3, 0.5, and 0.7 respectively. The regression coefficients {f3j, 1 < 
j ^ p}, were generated from a Laplace distribution with location parameter two and scale 
parameter one. 

For each different p, the simulated data consist of a training set, an independent validation 
set and an independent test set. Each method was used to fit the models using the training 
data, and the optimal number of components was chosen using the validation data. The 
maximum number of components is ten for all methods. The performance was evaluated 
based on the misclassification rate (MR) and sum of squares of the prediction residuals 
(PRESS) calculated from the test data. We simulated 100 data sets, each consisting of the 
training, validation, and test set with sample size being 100, 100, and 200, respectively. 

Shown in Table [1] are the frequencies of each method which has converged in analyzing 
100 simulated data sets with each specific p. It is well known that there is a divergence 
problem for both IRPLS implementations (Ding and Gentleman 2004; Fort and Lambert- 
Lacroix 2005; Boulesteix and Strimmer 2006; Chung and Keles 2010). Indeed, the IRPLS-M 
did not converge in analyzing any of the simulated data sets. IRPLS-DG partially solved 
this problem through Firth's procedure. However, it still did not converge in analyzing, for 
example, 23% of the data sets with p = 0.5. On the other hand, both GOCREq and GOCRE 
converged in all data analyses. 

The MR and PRESS in analyzing different models are shown in Table [2] for all methods. 
We observe that, for each method, the higher the correlation among the predictors, the lower 
MR and PRESS. For either GOCREq or GOCRE, bold MR and PRESS values indicate better 
performance than IRPLS-M as well as IRPLS-DG. Interestingly, IRPLS-DG reported smaller 
MR than IRPLS-M except for the case of p = 0.3, while IRPLS-DG always reported smaller 
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Table 1: Convergence Frequencies of Different Methods in Analyzing Simulated Data. 



Methods 


p = 0.0 


p = 0.3 


/9 = 0.5 


p = 0.7 


IRPLS-M 


0% 


0% 


0% 


0% 


IRPLS-DG 


79% 


82% 


77% 


94% 


GOCREo 


100% 


100% 


100% 


100% 


GOCRE 


100% 


100% 


100% 


100% 



PRESS than IRPLS-M except for the case of p = 0.7. In all cases, GOCREq performed better 
than both IRPLS methods in terms of either criterion, except that IRPLS-M reported the 
smallest PRESS when p = 0.7. Indeed, GOCRE reported larger MR than IRPLS-DG only 
in the case of p = 0.0, but performed better than both IRPLS methods in all other cases. 
In addition to their competitive performance and solving the convergence issue, GOCREq 
and GOCRE also enjoy advantage over the other two methods in computing time which 
is a critical issue in analyzing high- dimensional data. As shown in the next section, both 
GOCREq and GOCRE can significantly reduce the computing time. 

Table 2: Performance Comparison in Analyzing Simulated Data. Reported are the me- 
dian MR and PRESS across 100 simulated data sets, with standard errors presented in the 
parentheses. 



Criterion 


Model 


IRPLS-M 


IRPLS-DG 


GOCREq 


GOCRE 


MR 


p = 0.0 


.4350(.0313) 


.4250(.0349) 


.4250(.0330) 


.4275(.0332) 




p = 0.3 


.3900(.0350) 


.3950(.0364) 


.3825(.0365) 


.3850(.0365) 




p = 0.5 


.3525(.0378) 


.3450(.0337) 


.3350(.0336) 


.3350(.0336) 




p = 0.7 


.3050(.0337) 


.2900(.0310) 


.2850(.0347) 


.2850(.0346) 


PRESS 


p = 0.0 


.2671(.0169) 


.2414(.0057) 


.24O5(.0058) 


.2405(.0058) 




p = 0.3 


.2475(.0217) 


.2330(.0064) 


.2313(.0066) 


.2312(.0067) 




p = 0.5 


.2290(.0195) 


.2223(.0069) 


.2208(.0072) 


.2207(.0072) 




p = 0.7 


.2004(.0185) 


.2034(.0073) 


.2034(.0084) 


.2033(.0085) 



As shown in Preposition 3, GOCREq and GOCRE should report exactly the same results 
since we preprocessed X in simulated data such that 1„W^X = 0^. However, 1„W^X = 0^ 
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could not be computationally obtained due to the computer precision. For example, our 
computation in MATLAB would return centralized X with column means in the scale of 
10^^^ instead of exact zero, which resulted in the slight difference between GOCREq and 
GOCRE. 

4 APPLICATION TO GENE EXPRESSION PROFIL- 
ING 

Here we use the lung cancer data set (Gustafson et al., 2010) from Gene Expression Omnibus 



(GEO; http://www.ncbi.nlm.nih.gov/geo/) to illustrate the performance of GOCRE when 



compared to the two different IRPLS implementations. In this data set, a total of 187 arrays 
were used to monitor the expression levels of 22,215 genes from 97 cancer patients and 90 
healthy individuals. For each gene, a p- value was obtained from the Wilcoxon rank-sum test. 
Ranking the genes ascendingly on the basis of their p- values, we constructed four data sets 
by including the top 1000, 2000, 5000, and all genes respectively. 

For each data set, we randomly selected one-quarter of the samples to form the test data 
(with 24 cancer patients and 23 normal persons) and used the rest as training data. We 
applied each method to build up the models with different number of components (up to 20 
components) using the training data, and then calculated MR and PRESS of each model 
based on the test data. The results are shown in Figure [T] and Figure [2J 

As shown in Figure [T|, IRPLS-M did not converge in constructing almost every compo- 
nent. Although IRPLS-DG converged in constructing majority of the components, it did 
not converge in constructing some components in analyzing each of the four data sets. As 
expected, both GOCREq and GOCRE performed similarly, and converged in constructing 
each component. Except for the case with p = 5000 where GOCREq and GOCRE might 
obtain smaller MR than other methods, all methods obtained the same smallest MR in other 
cases. 

Similar to the performance in terms of MR in Figured], IRPLS-M presented rather wildly 
varying PRESS when different number of components were considered. The other three 
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(a) p = 1000 



(b) p = 2000 
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Figure 1: Misclassification Rates (MR) in Analyzing the Lung Cancer Data. The results were 
obtained using IRPLS-M (dotted lines), IRPLS-DG (dashed dotted lines), GOCREq (dashed 
lines), and GOCRE (solid lines) respectively to build up models with different number of 
components (/t). Non-converged methods were marked by diamonds for IRPLS-M, and 
circles for IRPLS-DG. 



15 



(a) p = 1000 



(b) p = 2000 
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Figure 2: Sum of Squares of the Prediction Residuals (PRESS) in Analyzing the Lung 
Cancer Data. The results were obtained using IRPLS-M (dotted lines), IRPLS-DG (dashed 
dotted lines), GOCREq (dashed lines), and GOCRE (solid lines) respectively. Non-converged 
methods were marked by diamonds for IRPLS-M, and circles for IRPLS-DG. 
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methods instead presented very stable and similar PRESS while the PRESS of IRPLS-DG 
are slightly smaller than those of GOCREq and GOCRE. Note that the slight advantage 
of IRPLS-DG in PRESS did not imply improvement in MR. Indeed, all methods except 
IRPLS-M presented very similar MR for most models. 

Not surprisingly, both GOCREq and GOCRE were much faster than the other two meth- 
ods because the IRPLS implementations need to reconstruct a set of components at each 
iteration, but GOCREq and GOCRE sequentially construct all components. As shown in 
Table |3l they took much less computing time than the other two methods when analyzing 
the lung cancer data using GLMs with different number of components (up to 20 compo- 
nents). Indeed, for each GLM model with k components, both IRPLS implementations need 
to update the predictor matrix k times within each iteration. For large p small n data set 
which produces a high dimensional predictor matrix, it is time consuming to update the 
predictor matrix. 

Table 3: Computation Time (Seconds) in Analyzing the Lung Cancer Data. 



p 


IRPLS-M 


IRPLS-DG 


GOCREq 


GOCRE 


1000 


2,384 


263 


8 


7 


2000 


5,334 


609 


15 


12 


5000 


11,750 


819 


35 


28 


22215 


48,531 


2,972 


443 


370 



5 DISCUSSION 

Zhang et al. (2009) proposed an orthogonal-component regression (OCRE) for supervised 
construction of principal components which account for the variation of continuous responses. 
OCRE can be considered as an alternative implementation of PLS. We here propose GOCRE 
which extends OCRE for GLMs, focusing on binary outcomes. Such an extension makes it 
feasible to extend POCRE in Zhang et al. (2009) for GLMs, allowing to select variables from 
a large amount of candidates. However, the need of iterated procedures like IRLS for fitting 
classical GLMs challenges the extension. One challenge follows the use of weighted linear 
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models to construct the components. Available components would suggest a weighted linear 
model with updated weights when constructing a new component. Indeed, any update of 
the same component in each iteration would suggest a new set of weights. Here we suggest 
to fix the weights upon the construction of the first component, which usually provide a 
satisfactory set of orthogonal components. A fixed set of weights can also be suggested using 
some other preliminary analysis of the data. 

GOCRE is proposed to sequentially construct a set of orthogonal components and it 
allows to investigate the relationship between a categorical outcome and a set of variables of 
interest. Cross validation can be used to determine the number of orthogonal components 
when it is targeted to fit the underlying GLM. We may be interested in a set of orthogonal 
components which may take account of the variation in the categorical outcome. Therefore, 
the coefficient of determination as defined by Nagelkerke (1991) may be employed. An 
entropy measure may also be explored for such a purpose. 

APPENDIX: PROOF OF PREPOSITION 3 

Expand VtoV = (y Vc) such that V^V = Ip, and further expand U to U such that U^U = Ip 
in the following way. 



U = < 



Accordingly, we expand A to a p x p matrix A as follows. 





Let X^ = l^i/^x and X^ = UAV\ then X^X^ = X^X^ = VA'^V\ which implies that 



(X^X^)+ = VA+^V\ and 




It follows that X^(X^X^)+X^ = UKV'vK+'^V'V kU' = UUK That is, A = UU'. 

When k = n, U is an orthonormal matrix which implies that A = f/f/* = J„. 

When A; = n - 1 and l^W^X = {W^/HnY^S.^ = 0^, then U = {U W^^Hn/^/Wh) is 
an orthonormal matrix. That is, J„ = t/f/* = f/f/* + W^^'^lnllW^^'^ /\\W\\i, which implies 

A = f/f/* = /„ - w^/Hniiw^^y\\w\\i. 
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